Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verification of dtypes of columns of X_row* is same that self.X #300

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

salmuz
Copy link

@salmuz salmuz commented Mar 25, 2024

Hello, I want to contribute by fixing two different bugs that are related to the usage of Ligthgbm. 

  1. NaN values in the category columns (which can cause an exception if we want to sorted(...))
       '<' not supported between instances of 'str' and 'float'
  1. Preserve the right dtypes of columns of X (dataframe) so that that Ligthm predict(..) function doesn't throw errors.

@salmuz salmuz changed the title Verification of dtypes of columns of X_sample is same that self.X Verification of dtypes of columns of X_row* is same that self.X Mar 25, 2024
@@ -1791,3 +1796,25 @@ def get_xgboost_preds_df(xgbmodel, X_row, pos_label=1):
0, "pred_proba"
]
return xgboost_preds_df


def check_dtype_of(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some tests for this? How flexible is it? (e.g. will it break over float32 vs float64? int vs float? etc)

@@ -50,6 +50,7 @@


from .explainer_methods import *
from .explainer_methods import check_dtype_of
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add check_dtype_of to the __all__ at the start in explainer_methods.py then it is covered by the import * (generally import * is frowned upon, but it's okay as long as you define a restrictive __all__)

@@ -241,7 +242,9 @@ def __init__(
col for col in self.regular_cols if not is_numeric_dtype(self.X[col])
]
self.categorical_dict = {
col: sorted(self.X[col].unique().tolist()) for col in self.categorical_cols
col: sorted(
v for v in self.X[col].unique().tolist() if not pd.isna(v)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an expert on lightgbm, but wouldn't there be usecases where na would be a category? Or is that handled differently? How about by catboost or other libraries?

df_target is not None and
not df_target[features].dtypes.eq(df_origin[features].dtypes).all()
):
df_target[features] = df_target[features].astype(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general not a fan of these functions that modify in place. Could you rewrite it such that it returns the transformed df instead? Then maybe call it adjust_dtypes_to_match_df(...) or something?

Calling something check_dtype_of when it actually modifies one of the arguments is confusing.

@oegedijk
Copy link
Owner

cool, thanks! tests are passing, but please have a look at my comments and see if you can add a few test cases for this new function...

@salmuz
Copy link
Author

salmuz commented Mar 28, 2024

Hello, I will do the requested changes as soon as possible (the next week). Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants